Search CORE

31 research outputs found

Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping

Author: King Simon
Oura Keiichiro
Tokuda Keiichi
Wester Mirjam
Yamagishi Junichi
Publication venue: 'Elsevier BV'
Publication date: 01/07/2012
Field of study

Crossref

Edinburgh Research Explorer

Unsupervised Cross-lingual Speaker Adaptation for HMM-based Speech Synthesis

Author: King Simon
Oura Keiichiro
Tokuda Keiichi
Wester Mirjam
Yamagishi Junichi
Publication venue
Publication date: 19/01/2011
Field of study

In the EMIME project, we are developing a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrate two techniques, unsupervised adaptation for HMM-based TTS using a word-based large-vocabulary continuous speech recognizer and cross-lingual speaker adaptation for HMM-based TTS, into a single architecture. Thus, an unsupervised cross-lingual speaker adaptation system can be developed. Listening tests show very promising results, demonstrating that adapted voices sound similar to the target speaker and that differences between supervised and unsupervised cross-lingual speaker adaptation are small

CiteSeerX

Edinburgh Research Archive

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

Author: Hashimoto Kei
Hono Yukiya
Nakamura Kazuhiro
Nankaku Yoshihiko
Oura Keiichiro
Takaki Shinji
Tokuda Keiichi
Yoshimura Takenori
Publication venue
Publication date: 21/11/2022
Field of study

This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundamental frequency, respectively. We implement the mel-cepstral synthesis filter as a differentiable and GPU-friendly module to enable the acoustic and waveform models in the proposed system to be simultaneously optimized in an end-to-end manner. Experiments show that the proposed system improves speech quality from a baseline system maintaining controllability. The core PyTorch modules used in the experiments will be publicly available on GitHub.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Recent development of the HMM-based speech synthesis system (HTS)

Author: Black Alan W
Masuko Takashi
Nose Takashi
Oura Keiichiro
Sako Shinji
Toda Tomoki
Tokuda Keiichi
Yamagishi Junichi
Zen Heiga
Publication venue
Publication date: 01/01/2009
Field of study

A statistical parametric approach to speech synthesis based on hidden Markov models (HMMs) has grown in popularity over the last few years. In this approach, spectrum, excitation, and duration of speech are simultaneously modeled by context-dependent HMMs, and speech waveforms are generate from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named “HMM-based speech synthesis system (HTS)” to provide a research and development toolkit for statistical parametric speech synthesis. This paper describes recent developments of HTS in detail, as well as future release plans

CiteSeerX

NAIST Academic Repository

Edinburgh Research Archive

Edinburgh Research Explorer

Hokkaido University Collection of Scholarly and Academic Papers

Personalising speech-to-speech translation in the EMIME project

Author: Byrne William
Dines John
Garner Philip N.
Gibson Matthew
Guan Yong
Hirsimaki Teemu
Karhila Reima
King Simon
Kurimo Mikko
Liang Hui
Oura Keiichiro
Saheer Lakshmi
Shannon Matt
Shiota Sayaka
Tian Jilei
Tokuda Keiichi
Wester Mirjam
Wu Yi-Jian
Yamagishi Junichi
Publication venue
Publication date: 01/07/2010
Field of study

In the EMIME project we have studied unsupervised cross-lingual speaker adaptation. We have employed an HMM statistical framework for both speech recognition and synthesis which provides transformation mechanisms to adapt the synthesized voice in TTS (text-to-speech) using the recognized voice in ASR (automatic speech recognition). An important application for this research is personalised speech-to-speech translation that will use the voice of the speaker in the input language to utter the translated sentences in the output language. In mobile environments this enhances the users' interaction across language barriers by making the output speech sound more like the original speaker's way of speaking, even if she or he could not speak the output language

CiteSeerX

Edinburgh Research Archive

Edinburgh Research Explorer

Personalising speech-to-speech translation:Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis

Author: Byrne William
Dines John
Gibson Matthew
Hirsimäki Teemu
Karhila Reima
King Simon
Kurimo Mikko
Liang Hui
Oura Keiichiro
Saheer Lakshmi
Tokuda Keiichi
Wester Mirjam
Yamagishi Junichi
Publication venue: 'Elsevier BV'
Publication date: 01/02/2013
Field of study

In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics

Infoscience - École polytechnique fédérale de Lausanne

Edinburgh Research Explorer

Thousands of Voices for HMM-Based Speech Synthesis-Analysis and Application of TTS Systems Built on Various ASR Corpora

Author: Dines John
Guan Yong
Hu Rile
Karhila Reima
King Simon
Kurimo Mikko
Oura Keiichiro
Tian Jilei
Tokuda Keiichi
Usabaev Bela
Watts Oliver
Wu Yi-Jian
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2010
Field of study

In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an "average voice model" plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on "non-TTS" corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues

Edinburgh Research Archive

Edinburgh Research Explorer